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Abstract 

We consider a multi-armed bandit problem 
where the decision maker can explore and ex- 
ploit different arms at every round. The ex- 
ploited arm adds to the decision maker's cu- 
mulative reward (without necessarily observ- 
ing the reward) while the explored arm re- 
veals its value. We devise algorithms for this 
setup and show that the dependence on the 
number of arms, k, can be much better than 
the standard vfc dependence, depending on 
the behavior of the arms' reward sequences. 
For the important case of piecewise station- 
ary stochastic bandits, we show a significant 
improvement over existing algorithms. Our 
algorithms are based on a non-uniform sam- 
pling policy, which we show is essential to the 
success of any algorithm in the adversarial 
setup. Finally, we show some simulation re- 
sults on an ultra-wide band channel selection 
inspired setting indicating the applicability of 
our algorithms. 

1. Introduction 

Multi-armed bandits have long been a canonical frame- 
work for studying online learning under partial infor- 
mation constraints. In this framework, a learner has 
to repeatedly obtain rewards by choosing from a fixed 
set of k actions (arms), and gets to see only the re- 
ward of the chosen action. The goal of the learner is 
to minimize regret, namely the difference between her 
own cumulative reward and the cumulative reward of 
the best single action in hindsight. We focus here on 
algorithms suited for adversarial settings, which have 
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reasonable regret even without any stochastic assump- 
tions on the reward generating process. 

A central theme in multi-armed bandits is the 
exploration- exploitation tradeoff: The learner must 
choose highly-rewarding actions most of the time in 
order to minimize regret, but also needs to do some 
exploration in order to determine which actions to 
choose. Ultimately, the tradeoff comes from the as- 
sumption that the learner is constrained to observe 
only the reward of the action she picked. 

While being a compelling and widely applicable frame- 
work, there exist several realistic bandit-like settings, 
which do not correspond to this fundamental assump- 
tion. For example, in ultra-wide band (UWB) com- 
munications, the decision maker, also called the "sec- 
ondary," has to decide in which channel to transmit 
and in what way. There are typically many possible 
channels (i.e., frequency bands) and several transmis- 
sion methods (power, code used, modulation, etc.; see 
(Oppermann et al., 2004)). In some UWB devices, the 
secondary can sense a different channel (or channels) 
than the one it currently uses for transmission. In 
fact, in some settings, the secondary cannot sense the 
channel it is currently transmitting in because of in- 
terference. The UWB environment is extremely noisy 
since it potentially contains many other sources, called 
"primaries." Some of these sources are sources whose 
behavior (which channel they use, for how long, and 
in which power level) can be very hard to predict as 
they represent a mobile device using WiMAX, WiFI 
or some other communication protocol. It is therefore 
sensible to model the behavior of primaries as an ad- 
versarial process or a piecewise stationary process. We 
should mention that UWB networks are highly com- 
plex, with many issues such as power constraints and 
multi-agency that have been considered in the multi- 
armed bandit framework (Liu & Zhao, 2010; Avner & 
Mannor, 2011; Lai et al., 2008), but the decoupling of 
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sensing and transmission has not been considered to 
the best of our knowledge. More abstractly, our work 
relates to any bandit-like setting, where we are free 
to query the environment for some additional partial 
information, irrespective of our actual actions. 

In such settings, the assumption that the learner can 
only observe the reward of the action she picked is 
an unnecessary constraint, and one might hope that 
removing this constraint and constructing suitable al- 
gorithms would allow better performance. We empha- 
size that this is far from obvious: In this paper, we 
will mostly focus on the case where the learner may 
query just a single action, so in some sense the learner 
gets the same "amount of information" per round as 
the standard bandit setting (i.e., the reward of a single 
action out of k actions overall) . The goal of this paper 
is to devise algorithms for this setting, and analyze 
theoretically and empirically whether the hope for im- 
proved performance is indeed justified. We emphasize 
that our results and techniques naturally generalize to 
cases where more than one action can be queried, and 
cases where the reward of the selected action is always 
revealed (see Sec. 7). 

Specifically, our contributions are the following: 

• We present a "decoupled" multi-armed bandit al- 
gorithm, which is suited to our setting. The algo- 
rithm is based on a certain querying distribution, 
which is adaptive and depends on the distribution 
by which the actions are actually picked. We show 
a "data-dependent" regret guarantee for the algo- 
rithm, which is never worse than that of standard 
bandit algorithms, and can be much better (in 
terms of dependence on the number of actions k), 
depending on how the actions' rewards behave. 

• We prove that in certain settings (in particular, 
piecewise stochastic rewards), the decoupling as- 
sumption allows us to devise algorithms with sig- 
nificantly better performance than any possible 
standard bandit algorithm. 

• Our algorithms are based on a certain adap- 
tive querying distribution, in contrast to previous 
works in the stochastic case where the querying 
distribution was uniform. We show that in some 
sense, such an adaptive policy is necessary in an 
adversarial setting, in order to get performance 
improvements compared to standard bandit algo- 
rithms. 

• We perform a preliminary experimental study, 
corroborating our theoretical findings and indi- 
cating that our algorithmic approach indeed leads 



to improved results, compared to standard ap- 
proaches. 

The proofs of our theorems are provided in the ap- 
pendix of the full version (Avner et al., 2012). 

Related Work. The idea of decoupling exploration 
and exploitation has appeared in a few previous works, 
but in different settings and contexts. For example, 
(Yu & Mannor, 2009) discuss a setting where the 
learner is allowed to query an additional action in a 
multi-armed bandit setting, but the focus there was 
on algorithms for stochastic bandits, as opposed to 
adversarial bandits as we do here. (Agarwal et al., 
2010) study a bandit setting with (one or more) queries 
per round. However, they focus on the problem of 
bandit convex optimization, which is much more gen- 
eral than ours, and exploration and exploitation re- 
mains coupled in their framework. A different line 
of work ((Evcn-Dar et al., 2006; Audibert et al., 2010; 
Bubeck et al., 2011)) considers multi-armed bandits in 
a stochastic setting, where the goal is to identify the 
best action by performing pure exploration. While this 
work also conceptually "decouples" exploration and 
exploitation, the goal and setting are quite different 
than ours. 

2. Problem Setting 

We use [k] as shorthand for {1, . . . , k}. Bold-face let- 
ters represent vectors, and 1a represents the indicator 
function for an event A. We use the standard big- 
Oh notation O(-) to hide constants, and O(-) to hide 
constants and logarithmic factors. For a distribution 
vector p on the /c-simplex, we use the notation 




to describe the '£1/2 '-norm of the distribution. It is 
straightforward to show that for a distribution vector, 
this quantity is always in [l,fc]. In particular, it is k 
for the uniform distribution, and gets smaller the more 
non-uniform the distribution is, attaining the value of 
1 when p is a unit vector. 

Our setting is a variant of the standard adversarial 
multi-armed bandit framework, focusing (for simplic- 
ity) on an oblivious adversary and a fixed horizon. In 
this setting, we have a fixed set of k > 1 actions and 
a fixed known number of rounds T. Each action i 
at each round t has an unknown associated reward 
gi(t) € [0,1]. At each round, a learner chooses one 
of the actions it, and obtains the associated reward 
<7i t (t). The basic goal in this setting is to minimize the 
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regret with respect to the best single action in hind- 
sight, namely 



max^2 gi(t) - ^3 



Unless specified otherwise, we make no assumptions 
on how the rewards gi(t) are generated (other than 
boundedness) , and they might even be generated ad- 
versarially by an agent with full knowledge of our al- 
gorithm. However, we assume that the rewards are 
fixed in advance and do not depend on the learner's 
(possibly random) choices in previous rounds. 

In standard multi-armed bandits, at the end of each 
round, the learner only gets to know the reward gi t (t) 
of the action i t which was actually picked, but not 
the reward of other actions. Instead, in this paper 
we focus on a different setting, where the learner, af- 
ter choosing an action i t , may query a single action j t 
and get to see its associated reward gj t (t) . This setting 
is a (slight) relaxation of the standard bandit setting, 
since we can always query j t — it- However, here it is 
possible to query an action different than i t . We em- 
phasize that the regret is still measured with respect 
to the chosen actions i t , and the querying only has 
informational value. In order to compare our results 
with those obtainable in the standard setting, we will 
use the term standard bandit algorithm to refer to al- 
gorithms which are not free to query rewards, and are 
limited to receiving the reward of the chosen action. 
A typical example is the EXP3.P (Auer et al, 2002), 
with a O(VkT) regret upper bound, holding with high 
probability, or the Implicitly Normalized Forecaster of 
(Audibert & Bubeck, 2009) with 0(Vkf) regret. 

An interesting variant of our setting is when the 
learner gets to query more than one action, or gets 
to see gi t {t) on top of gj t (t). Such variants are further 
discussed in Sec. 7. 

3. Basic Algorithm and Results 

In analyzing our "decoupled" setting, perhaps the first 
question one might ask is whether one can always get 
improved regret performance, compared to the stan- 
dard bandit setting. Namely, that for any reward as- 
signment, the attainable regret will always be signifi- 
cantly smaller. Unfortunately, this is not the case: It 
can be shown that there exists an adversarial strategy 
such that the regret of standard bandit algorithms is 
Q(VkT), whereas the regret of any "decoupled" algo- 
rithm will be 1 fl(\/kT). Therefore, one cannot hope to 

1 One simply needs to consider the strategy used to ob- 
tain the Q(y/kT) regret lower bound in the standard bandit 



always obtain better performance. However, as we will 
soon show, this can be obtained under certain realistic 
conditions on the actions' rewards. 

We now turn to present our first algorithm (Algo- 
rithm 1 below) and the associated regret analysis. 
The algorithm is rather similar in structure to stan- 
dard bandit algorithms, picking actions at random in 
each round t according to a weighted distribution p(i) 
which is updated multiplicatively. The main differ- 
ence is in determining how to query the reward. Here, 
the queried action is picked at random, according to a 
query distribution q(t) which is based on but not iden- 
tical to p(t). More particularly, the queried action j t 
is chosen with probability 



(1) 



Roughly speaking, this distribution can be seen as a 
"geometric average" between p(t) and a uniform dis- 
tribution over the k actions. See Algorithm 1 for the 
precise pseudocode. 

Algorithm 1 Decoupled MAB Algorithm 

Input: Step size parameter [i € [l,k], confidence 
parameter 5 G (0, 1) 

Let n = P = 2?7 V /61og(3fc/<5) and 7 = 

n 2 (l + l3) 2 k 2 

V j G [k] let ujj(l) = 1. 

for t = 1,...,T do 



VjG [fc], let = -7), 



Choose action i t with probability pi t (t) 
Query reward g$ t (t) with probability 

_ y/Pj t (*) 

V j G [k], let gj(t) = ^ (fc(f)lj t =J + P) 

V j G [k], let Wj(t + 1) = Wj(t) exp(?7<7j(i)) 
end for 

Readers familiar with bandit algorithms might notice 
the existence of the common "exploration component" 
j/k in the definition of Pj(t). In standard bandit al- 
gorithm, this is used to force the algorithm to explore 
all arms to some extent. In our setting, exploration 
is performed via the separate query distribution qj (t) , 
and in fact, this 7/fc term can be inserted into the 
qj (t) definition instead. While this would be more aes- 
thetically pleasing , it also seems to make our proofs 



setting (Auer et al., 2002). The lower bound proof can be 
shown to apply to a "decoupled" algorithm as well. Intu- 
itively, this is because the hardness for the learner stems 
from distinguishing slightly different distributions based on 
at most T samples, which has nothing to do with the cou- 
pling constraint. 
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and results more complicated, without substantially 
improving performance. Therefore, we will stick with 
this formulation. 

Before discussing the formal theoretical results, we 
would like to briefly explain the intuition behind this 
querying distribution. Most bandit algorithms (in- 
cluding ours) build upon a standard multiplicative up- 
dates approach, which updates the distribution p(t) 
multiplicatively based on each action's rewards. In 
the bandit setting, we only get partial information on 
the rewards, and therefore resort to multiplicative up- 
dates based on an unbiased estimate of them. The key 
quantity which controls the regret is the variance of 
these estimates, in expectation over the action distri- 
bution p(t). In our case, this quantity turns out to 

be on the order of J2j=iPj(t)/<lj(t)- Now, standard 
bandit algorithms, which may not query at will, are 
essentially constrained to have Qj{t) = Pj(t), leading 
to an expected variance of fc and hence the fc in their 
0(ykT) regret bound. However, in our case, we are 
free to pick the querying distribution q(t) as we wish. 
It is not hard to verify that X)j=i Pj {fy/lj W * s mini- 
mized by choosing q(f) as in Eq. (1), with the value of 
||p(i)||i/2- Thus, roughly speaking, instead of depen- 
dence on k, we get a dependence on ~ Y^t=i IIp(*)IIi/2) 
as will be seen shortly. 

The theoretical analysis of our algorithm relies on the 
following technical quantity: For any algorithm pa- 
rameter choices /i, 6, and for any v € [1, A;], define 



P(v,8,fi) 




> v 



where the probability is over the algorithm's random- 
ness, run with parameters fj,, S, with respect to the 
(fixed) reward sequence. The formal result we obtain 
is the following: 

Theorem 1. Suppose that T is sufficiently large (and 
thus r\ and j3 sufficiently small) so that (1 + /3) 2 < 2. 
Then for any v € [l,k], it holds that with probability 
at least 1 — d — P(v,8, n) that the sequence of rewards 
5 , i 1 (l), . . . ,Qi T (T) returned by Algorithm 1 satisfies 

T T 



t=i 



< O 




v T 



fc 2 

T 3/2 



where the O notation hides numerical constants and 
factors logarithmic in k and S. 

At this point, the nature of this result might seem a bit 
cryptic. We will soon provide more concrete examples, 



but would like to give a brief general intuition. First 
of all, if we pick fj, = v — k, then P(v, 5, n) = always 
(as ||p(f)lli/2 < k), and the bound becomes 0{ykT), 
holding with probability 1 — 8, similar to standard 
multi-armed bandit guarantees. This shows that our 
algorithm's regret guarantee is never worse than that 
of standard bandit algorithms. However, the theorem 
also implies that under certain conditions, the result- 
ing bound may be significantly better. For example, if 
we run the algorithm with fj, = 1 and have v = 0(1), 

then the bound becomes O (y^Tj for sufficiently large 

T. This bound is meaningful only if P(0(l),5,l) is 
reasonably small. This would happen if the distribu- 
tion vectors p(t) chosen by the algorithm tend to be 
highly non-uniform, since it leads to a small value for 

?EtlllP(*)Hl/2- 

We now turn to provide a concrete scenario, where 
the bound we obtain is better than those obtained by 
standard bandit algorithms. Informally, the scenario 
we discuss assumes that although there are k actions, 
where k is possibly large, only a small number of them 
are actually "relevant" and have a performance close 
to that of the best action in hindsight. Intuitively, 
such cases would lead to the distribution vectors p(t) 
to be non-uniform, which is favorable to our analysis. 
Theorem 2. Suppose that the reward of each action 
is chosen i.i.d. from a distribution supported on [0, 1]. 
Furthermore, suppose that there exist a subset G C [fc] 
of actions and a parameter A > (where \G\, A are 
considered constants independent of k,T), such that 
the expected reward of any action in G is larger than 
the expected reward of any action in [fc] \ G by at least 
A. Then if we run our algorithm with 

k min{l, max{0, |-| log fc (T)}} 

it holds with probability at least 1 — 5 that the regret of 
the algorithm is at most 



^ fc max{0,§-±log fc (T)} T 



o 



where the O notation hides numerical constants and 
factors logarithmic in 5,k. 

The bound we obtain interpolates between the usual 
0(\/kT) bound obtained using a standard bandit al- 
gorithm, and a considerably better O(VT), as T gets 
larger compared with fc. We note that a mathemati- 
cally equivalent form of the bound is 



max 



Namely, the average per-round regret scales down as 
(fc/T) 2 / 3 , until T is sufficiently large and we switch to 
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a (1/T) 1 / 2 regime. In contrast, the bound for standard 
bandit algorithms is always of the form (k/T) 1 / 2 , and 
the rate of regret decay is significantly slower. 

We emphasize that although the setting discussed 
above is a stochastic one (where the rewards are cho- 
sen i.i.d.), our algorithm can cope simultaneously with 
arbitrary rewards, unlike algorithms designed specifi- 
cally for stochastic i.i.d. rewards (which do admit bet- 
ter dependence in T, although not necessarily in k). 

Finally, we note in practice, the optimal choice of fi 
depends on the (unknown) rewards, and hence can- 
not be determined by the learner in advance. How- 
ever, this can be resolved algorithmically by a stan- 
dard doubling trick (cf. (Ccsa-Bianchi & Lugosi, 
2006)), without materially affecting the regret guaran- 
tee. Roughly speaking, we can guess an upper bound 
v on i Y^t=i IIpWIIi/2 an( i pick A* — w i an( i if the cu- 
mulative sum ^||p(t)||i/2 eventually exceeds Tv at 
some round, then we double v and fi and restart the 
algorithm. 

4. Decoupling Provably Helps in some 
Adversarial Settings 

So far, we have seen how the bounds obtained for our 
approach are better than the ones known for standard 
bandit algorithms. However, this doesn't imply that 
our approach would indeed yield better performance 
in practice: it might be possible, for instance, that 
for the setting described in Thm. 2, one can provide 
a tighter analysis of standard bandit algorithms, and 
recover a similar result. In this section, we show that 
there are cases where decoupling provably helps, and 
our approach can provide performance provably better 
than any standard bandit algorithm, for information- 
theoretic reasons. We note that the idea of decoupling 
has been shown to be helpful in cases reminiscent of 
the one we will be discussing (Yu & Mannor, 2009), but 
here we study it in the more general and challenging 
adversarial setting. 

Instead of the plain-vanilla multi-armed bandit set- 
ting, we will discuss here a slightly more general set- 
ting, where our goal is not to achieve regret with re- 
spect to the best single action, but rather to the best 
sequence of S > 1 actions. More specifically, we wish 
to obtain a regret bound of the form 



max 

l = Ti<T 2 

i 1 ,...,i s £[k] 



S T a+1 

<...<T,4.,=T' ' ' ' ^ ' 

"iSrtil S = lt=T s +l t=l 



& Warmuth, 1998) for full-information online learning 
(under the name of "tracking the best expert" ) and in 
(Auer et al., 2002) for the bandit setting (under the 
name of "regret against arbitrary strategies"). 

This setting is particularly suitable when the best ac- 
tion changes with time. Intuitively, our decoupling 
approach helps here, since we can exploit much more 
aggressively while still performing reasonable explo- 
ration, which is important for detecting such changes. 

The algorithm we use follows the lead of (Auer et al., 
2002) and is presented as Algorithm 2. The only dif- 
ference compared to Algorithm 1 is that the Wj(t+ 1) 
parameters are computed differently. This change fa- 
cilitates more aggressive exploration. 



Algorithm 2 Decoupled MAB Algorithm For Switch- 
ing 

Input: Step size parameter p G confidence 

parameter 5 G (0, 1), number of switches S 

Let r] = y/S/]iT, a=l/T,0 = 2r/y / 6log(3k/6) and 

7 = 77 2 (l + /3) 2 fc 2 

V j G [k] let 10,(1) = 1. 

for t = 1,...,T do 



VI j (t) 



ViG[fc] ) letp j (i) = (l- 7 )^ iW!(fc) , k 
Choose action i t with probability pi t (t) 
Query reward gj t (t) with probability Qj t (t) 

V j G [k], let gj(t) = (gj(t)l jt=J + (3) 

V j G [k], let wj(t + 1) = iOj(t)exp(TOj(t)) 

f Ef=i^(*) 

end for 



This setting is well-known in the online learning litera- 
ture, and has been considered for instance in (Herbster 



The following theorem, which is proven along similar 
lines to Thm. 1, shows that in this setting as well, we 
get the same kind of dependence on the distribution 
vectors p(t) as in the standard bandit setting. 



Theorem 3. Suppose that T is sufficiently large (and 
thus rj and f3 sufficiently small) so that (1 + /3) 2 < 2. 
Then for any v G [l,fc], it holds that with probabil- 
ity at least 1 — 6 — P(v, S, p) that the sequence of re- 
wards g il (l), . . . , g iT (T) returned by algorithm 2 sat- 
isfies the following, simultaneously over all segmenta- 
tions of {1, ... , T} to S epochs and a choice of action 
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i s to each epoch: 

S T s+1 t 

X X XX (*) 

s=l t=T B +l t=l 

^W s (7 + " + ") T + f + ^)' 

T/ie notation hides numerical constants and factors 
logarithmic in k and 8. 

In particular, we can also get a parallel version of 
Thm. 2, which shows that when there are only a small 
number of "good" actions (compared to k) , the leading 
term has decaying dependence on k, unlike standard 
bandit algorithms where the dependence on k is always 
y/k. 

Theorem 4. Suppose that the reward of each action 
is chosen i.i.d. from a distribution supported on [0, 1]. 
Furthermore, suppose that at each epoch s, there exists 
a subset G s C [k] of actions and a parameter A > 
(where \G S \, A are considered constants independent of 
k,T), such that the expected reward of any action in 
G s is larger than the expected reward of any action in 
[k] \ G s by at least A. Then if we run Algorithm 2 with 

fe min{l, max{0, f - 1 log fe (T)}} 

it holds with probability at least 1 — 5 that the regret of 
the algorithm is at most 

6 (Vsfc max {°<i-5 lo Sfc( T »T) , 

where the O notation hides numerical constants and 
factors logarithmic in 5 and k. 

Now, we are ready to present the main negative re- 
sult of this section, which shows that in the setting of 
Thm. 2, any standard bandit algorithm cannot have 
a regret better than f2(\/fcT), which is significantly 
worse. For simplicity, we will focus on the case where 
S = 2: namely, that we measure regret with respect 
to a single action from round 1 till some to, and then 
from to + 1 till T. Moreover, we consider a simple case 
where \G 1 \ — \G 2 \ — 1 and A = 1/5, so there is just 
a single action at a time which is significantly better 
than all the other actions in expectation. 

Theorem 5. Suppose that T > Ck for some suffi- 
ciently large universal constant C . Then in the setting 
of Thm. 2, there exists a randomized reward assign- 
ment (with |C X | = |G 2 | = 1 and A = 1/5,), such that 
for any standard bandit algorithm, its expected regret 
( over the rewards assignment and the algorithm 's ran- 
domness) is at least 0.007-y/ (k — 1)T. 



The constant 0.007 is rather arbitrary and is not the 
tightest possible. 

We note that a related 17 (vT) lower bound has been 
obtained in (Garivier & Moulines, 2011). However, 
their result does not apply to the case S = 2 and more 
importantly, does not quantify a dependence on k. It 
is interesting to note that unlike the standard lower 
bound proof for standard bandits (Aucr ct al., 2002), 
we obtain here an fl(VkT) regret even when A > is 
fixed and doesn't decay with T. 

5. The Necessity of a Non-Uniform 
Querying Distribution 

The theoretical results above demonstrated the effi- 
cacy of our approach, compared to standard bandit 
algorithms. However, the exact form of our querying 
distribution (querying action i with probability pro- 
portional to y/pj(t)) might still seem a bit mysterious. 
For example, maybe one can obtain similar results just 
by querying actions uniformly at random? Indeed, this 
is what has been done in some other online learning 
scenarios where queries were allowed (e.g., (Yu & Man- 
nor, 2009; Agarwal et al., 2010)). However, we show 
below that in the adversarial setting, an adaptive and 
non-uniform querying distribution is indeed necessary 
to obtain regret bounds better than \JkT . For sim- 
plicity, we return to our basic setting, where our goal 
is to compete with just the best single fixed action in 
hindsight. 

Theorem 6. Consider any online algorithm over k > 
2 actions and horizon T , which queries the actions 
based on a fixed distribution. Then there exists a strat- 
egy for the adversary conforming to the setting de- 
scribed in Thm. 2, for which the algorithm's regret is 
at least c\/kT for some universal constant c. 

A proof sketch is presented in the appendix of the full 
version. The intuition of the proof is that if the query- 
ing distribution is fixed, and there are only a small 
number of "good" actions, then we spend too much 
time querying irrelevant actions, and this hurts our 
regret performance. 

6. Experiments 

We compare the decoupled approach with common 
multi-armed bandit algorithms in a simulated adver- 
sarial setting. Our user chooses between k communica- 
tion channels, where sensing and transmission can be 
decoupled. In other words, she may choose a certain 
channel for transmission while sensing (i.e., querying) 
a different, seemingly less attractive, channel. 
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We simulate a heavily loaded UWB environment with 
a single, alternating, channel which is fit for transmis- 
sion. The rewards of k — 1 channels are drawn from 
alternating uniform and truncated Gaussian distribu- 
tions with random parameters, yielding adversarial re- 
wards in the range [0, 6]. The remaining channel yields 
stochastic rewards drawn from a truncated Gaussian 
distribution bounded in the same range but with a 
mean drawn from [3,6]. The identity of the better 
channel and its distribution parameters are re-drawn 
at exponentially distributed switching times. 
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Figure 1. Average reward for different algorithms over 
time. Shaded areas around plots represent the standard 
deviation over repetitions. 

Figure 1 displays the results of a scenario with k = 10 
channels, comparing the average reward acquired by 
the different algorithms over T = 10, 000 rounds. We 
implemented Algorithm 1, Exp3 (Auer et al., 2002), 
Exp3.P (Aucr et al., 2002), a simple round robin policy 
(which just cycles through the arms in a fixed order) 
and a "greedy" decoupled form of round robin, which 
performs uniform queries and picks actions greedily 
based on the highest empirical average reward. The 
black arrows indicate rounds in which the identity 
of the stochastic arm and its distribution parameters 
were re-drawn. The results are averaged over 50 rep- 
etitions of a specific realization of rewards. Although 
we have tested our algorithm's performance on several 
realizations of switching times and rewards with very 
good results, we display a single realization of these 
for the sake of clarity. 

Figure 2 displays the dynamics of channel selection for 
two of the k = 10 channels. The thick plots represent 
the number of times a channel was chosen over time, 
and the thin plots represent the number of times it 
was queried. The dashed plots represent a channel 
which was drawn as the better channel during some 
periods, resulting in a relatively high average reward, 




Figure 2. Number of times channels were chosen and 
queried over time, for two of k = 10 arms. Arrows mark 
times in which channel 2 was drawn as the better channel. 



while the solid plots represent a channel with a low 
average reward. The increased flexibility of the decou- 
pled approach is evident from the graph, as well as the 
adaptive, nonlinear sampling policy. 

Comments: We implement Algorithm 1 and not Al- 
gorithm 2 since the number of switches is unknown 
a-priori. Also, the rewards are in the range [0, 6] in 
order to keep all implemented algorithms on a similar 
scale, without violating the boundedness assumption. 

7. Discussion 

In this paper, we analyzed if and how one can benefit 
in settings where exploration and exploitation can be 
"decoupled:" namely, that one can query for rewards 
independently of the action actually picked. We devel- 
oped some algorithms for this setting, and showed that 
these can indeed lead to improved results, compared to 
the standard bandit setting, under certain conditions. 
We also performed some experiments that corroborate 
our theoretical findings. 

For simplicity, we focused on the case where only a 
single reward may be queried. If c > 1 queries are 
allowed, it is not hard to show parallel guarantees to 
those in this paper, where the dependence on k is re- 
placed by dependence on k/c. Algorithmically, one 
simply needs to repeatedly sample from the query dis- 
tribution c times, instead of a single time. We con- 
jecture that similar lower bounds can be obtained as 
well. Interestingly, it seems that being allowed to see 
the reward of the action actually picked, on top of 
the queried reward, does not result in significantly 
improved regret guarantees (other than better con- 
stants). 

Several open questions remain. First, our results do 
not apply when the rewards are chosen by an adap- 
tive adversary (namely, that the rewards are not fixed 
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in advance but may be chosen individually at each 
round, based on the algorithm's behavior in previous 
rounds). This is not just for technical reasons, but also 
because data and algorithm dependent quantities like 
P(v, 6, n) do not make much sense if the rewards are 
not considered as fixed quantities. 

A second open question concerns the possible correla- 
tion between sensing and exploration. In some appli- 
cations it is plausible that the choice of which arm to 
exploit affects the quality of the sample of the arm that 
is explored. For instance, in the UWB sensing example 
discussed in the introduction transmitting and receiv- 
ing in the same channel is much less preferred than 
sensing in another channel because of interference in 
the same frequency band. It would be interesting to 
model such dependence and take it into account in the 
learning process. 

Finally, it remains to extend other bandit-related al- 
gorithms, such as EXP4 (Auer et al., 2002), to our set- 
ting, and study the advantage of decoupling in other 
adversarial online learning problems. 
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A. Appendix 

A.l. Proof of Thm. 1 

We begin by noticing that for any possible distribution pi(t), . . . ,pk(t), it must hold that ||p(i)lli/2 <= [1, k]. We 
will use this observation implicitly throughout the proof. 

For notational simplicity, we will write P(v) instead of P(v,S,ii), since we will mainly consider things as a 
function of v where 6, [i are hxed. 

We will need the following two lemmas. 

Lemma 1. Suppose that fj < 1. Then it holds with probability at least 1 — S that for any i = 1, . . . , k, 



t=i 



log(fc/3) 

P 



Proof. Let E t denote expectation with respect to the algorithm's randomness at round t, conditioned on the 
previous rounds. Since exp(x) < 1 + x + x 2 for x < 1, we have by definition of (ji(t) that 



Et [exp 09(ft(i) -&(*)))] 
= E t 



, ,. 9i{t)\ jt=i \ P 2 



P UiW 



< ^1 + + /3 2 E t 



9i(t)T-jt = 



qi{t) 
P 2 



+ E t 



exp 



P (9i(t) 
Qi(t) 



9i(t)ht-- 

«(*) 



exp 



9i(t) 



, exp . 

?«(*)/ V H\t), 

Using the fact that (1 + a;) exp(— x) < 1, we get that this expression is at most 1. As a result, we have 

/ t 



E 



exp £ (ft (*)-&(*)) 



< 1. 



Now, by a standard Chernoff technique, we know that 
Prfe («(*)-&(*)) >e) < exp(-/3e)E 



eJtp(^^(ft(t)-5i(t)) 



< exp(— /?e). 



Substituting i5 = exp(— /3e), solving for e, and using a union bound to make the result hold simultaneously for 
all i, the result follows. □ 



We will also need the following straightforward corollary of Freedman's inequality (Freedman, 1975) (see also 
Lemma A. 8 in (Cesa-Bianchi & Lugosi, 2006)) 

Lemma 2. Let Xi, . . . , Xt be a martingale difference sequence with respect to the filtration {J-t}t=i....,T , md with 
\Xi\ < B almost surely for all i. Also, suppose that for some fixed v > and confidence parameter P(v) € (0, 1), 
it holds that PrQ^^Lj E[Xf |-7-t_i] > vT) < P(v). Then for any 5 £ (0,1), it holds with probability at least 
1-5- P{v) that 



T 

E 

t=i 
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We can now turn to prove the main theorem. We define the potential function W t = JZj—i w j(t)i an d S e t that 



k 



W t +i _ \ - Wj(t) 



iv E ^* L, ex ^~ 9j m (2) 



We have that ijgj(t) < 1, since by definition of the various parameters, 

Using the definition of Pj(t) and the inequality exp(x) < 1 + x + x 2 for any x < 1, we can upper bound Eq. (2) 
by 



E pj( ?_J /fc (i+wiW+^w 3 ) 

3=1 7 

fc 2 fc 

^ 1 + E ^ (*)& (*) + yztz E ft (*)ft W 2 



7 f — ' 1 — 7 ■ 

' 3=1 ' 3=1 

Taking logarithms and using the fact that log(l + x) < x, we get 

k n k 



i£) ~ r^Eft(^w + r^Eft(%w 2 - 

/ ' 3 = 1 ' 3 = 1 

Summing over all t, and canceling the resulting telescopic series, we get 



lo. 



T k 2 T k 



s(i^) * T^EEftWftW + T^EEftWftW 2 - (3) 



<y — < t — < 1—7 ' — ' ' — ' 

' t=l 3 = 1 ' t=l 3 = 1 

Also, for any fixed action i, we have 



^^)H^)=#' (f) - bs(t) - 



Combining Eq. (3) with Eq. (4) and slightly rearranging and simplifying, we get 

X>w - T^EEftWftW < ^ + T^EEftWftW 2 - (5) 

t=i * t=i 3=1 1 * t=i 3=1 

We now start to analyze the various terms in this expression. At several points in what follows, we will implicitly 
use the definition of q 3 (t) and the fact that ||p(t)||i/2 € [1, k]. 

Let E t denote expectation with respect to the randomness of the algorithm on round t, conditioned on the 
previous rounds. Also, let 



and note that (jj(t) = g'j(t) + -^-pj and E t [gj(t)] = 9j(t). We have that 



fa K / . \ K 

£>(*)&(*)) = EftWW+^ETTTT = EPi(*V(*) + /9||P(*)lli/2- (6) 

-1 sow 



3=1 3=1 3=1 J 3=1 
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Also, *5Z j=1 Pj{t){g' {t) — g{t)) is a martingale difference sequence (indexed by t), it holds that 



and 



< E* 



<&(*) 



E 



{ Qr(t) 



|p(t)lll/ 2 E^ /2 (*) < l|P(*)lll/3. 



3=1 3 = 1 3 Qj[) 

Therefore, applying Lemma 2, and using the assumptions stated in the theorem, it holds with probability at 
least 1 - S - P(v) that 



EEft(%'w <EEftW?w + \/ 21 °g (]) vT 



T log U 



(7) 



Moreover, we can apply Azuma's inequality with respect to the martingale difference sequence J2i=i Pj(t)d(j) 



gi t (t), indexed by t (since it is chosen with probability Pi t {t)), and get that with probability at least 1 — 5, 

EEftWsW - 9h (*) <\\ log (]) T. 
i=l j=l V ^ ' 



(8) 



Combining Eq. (6), Eq. (7) and Eq. (8) with a union bound, and recalling that the event Ylt=i IIp(*)IIi/2 — y T 
is assumed to hold with probability at least 1 — P(v), we get that with probability at least 1 — 2<5 — P(v), 



T k 



EE^Wfef')) -!>*(*) < P*r + \/21og ( ~ ) «T + W J log ( 1 ) T + ^ log ( 1 



t=l i=l 



(9) 



We now turn to analyze the term Ylj=iPj{^)9j(^)i using substantially the same approach. We have that 



= U{t) 

3=1 3=1 V 



3=1 3=1 3=1 



We 



note that 53j=i P3 W — max j = ||p(*)||i/2 _• k. This implies that 



E E * 

t=i 



< 



T / T \ 2 

t=i \t=i / 



Applying Lemma 2, we get that with probability at least 1 — 5 — P(v), 

T k T 



EEwWtfw < E E * 
t=i j=i (=i 



3=1 



«T4/21og 



log 



Moreover, 



E, 



3 = 1 



^ E^w^i = Hp(*)IIi/2. 
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so overall, we get that with probability at least 1 — S — P(v), 

' ' ' ' ' +2/3 2 k 2 . (10) 



£ !>(*)#(*) ^ 2vT + \/ 2l0g G)) +klog Q 



Combining Lemma 1, Eq. (9) and Eq. (10) with a union bound, substituting into Eq. (5), and somewhat 
simplifying, we get that with probability at least 1 — S — P(v), 



"fEftO-E^W - lT + 2vT {P + W61og(3/<S)) + 2V51og(3/<5)«T 



t=i 



log(3fc/<5) log(fc) 



where O hides numerical constants and factors logarithmic in S. Substituting our choices of 7 and f3, and again 
somewhat simplifying, we get the bound 



maxf>(i) -5>h(*) < n ( 8 ^ 61 °s(f) uT ) + J (Jh l0S {lf) +l0S{k) ) +2 ^ fc2T 

+ 2v/51og(3/5)uT + (!) (\/fc + f]k + fc 2 r/ 3 ) . 
Plugging in 7] = fiT, we get the bound stated in the theorem. 
A. 2. Proof of Thm. 2 

For notational simplicity, we will use the O-notation to hide both constants and second-order factors (as T/k - 
00). Inspecting the proof of Thm. 1, it is easy to verify 2 that it implies that with probability at least 1 — 6 

t t k / n~ 2 \ u2 p \ 

»«gft(*)-Egpj(*)ffi(*) < a (y(7+^ + w ) T +7 + ^J- 

Suppose w.l.o.g. action 1 is in G. Then it follows that 

' k 2 \ 



£5>(t)Mt) -*(*)) < d(J(^ + ^ + v^T+^ 



T 3/2 



This bound holds for any choice of rewards. Now, we note that each gj(t) is chosen i.i.d. and independently 

^k 



of Pj(t)), and thus J2j-iPj(t)({gi(t) — 9j(t)) ~ — 9j(t)]) is a martingale difference sequence. Applying 



Azuma's inequality, we get that with probability at least 1 — S over the choice of rewards, 

T k T k 

££**(*) Gft(<) -<&(*)) > £!>(*) - 9j(t)}) V21og(l/<J)T 

t=i j=i {=1 j=i 

T 

>EE Pj (*)A- V21og(l/<5)T. 

t=ije[fc]\G 

Thus, by a union bound, with probability at least 1 — 2(5 — P(v,S,fi) over the randomness of the rewards and 
the algorithm, we get 

k 2 k 2 ' 



T 



t=l ]e[ k]\G 



2 The difference from Thm. 1 is that the term 2t=i Ji ( (t) is replaced by X/»=i ^ n the proof, we transformed 

the latter to the former by a martingale argument, but we could have just left it there and achieve the same bound. 
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where O hides an inverse dependence on A. Now, we relate the left hand size to 4 ^t=i IIpWIIi/2- To do so, 
we note that for any vector x with support of size \G\, it holds that ||x||]/2 < |G|||x||i. Using this and the fact 
that (a + b) 2 < 2a 2 + 2b 2 , we have 





iip(*)iii/2 = Evw)+ E vft-(*) <2 Ev^-w 

yec je[fc]\G / \j£G 

< 2\G\ + k J2 PM 

je[k]\G 

Plugging this back to Eq. (11), and recalling that \G\ is considered a constant independent of k,T, we get that 
with probability at least 1 — 2<5 — P(v, 6, it holds that 

^EIIpWIIi/ 2 <o 

t=l 

Recall that this bound holds for any v. In particular, if we pick v — k, then P(v, 6, /i) = 0, and we get that with 
probability at least 1 — 26, 



t=l 

This gives us a high-probability bound, holding with probability at least 1 — 26, on ^ Y^t=i IIpWIIi/2- But this 
means that if we pick v to equal the right hand size of Eq. (12), then by the very definition of P(v, 6, fi), we get 
P(v, 6, fi) = 26. Using this choice of v and applying Thm. 1, it follows that with probability at least 1 — 46, the 
regret obtained by the algorithm is at most 

f ~ ( k 2 k 3 k 3 M 
where v = max jl , O + — + ^ J j . (13) 

Now, it remains to optimize over fi to get a final bound. As a sanity check, we note that when fi = k and T > k 
we get 

fk 3 k 2 k 3 \ 





<0(k), 

and a reg ret bound of 0{VkT), same as a standard bandit algorithm. On the other hand, when /i = 1 and 
T > f2(fc 4 ), we get v = 0(1) and a regret bound of 0{yT), which is much better. The caveat is that we need 
T to be sufficiently large compared to k in order to get this effect. To understand what happens in between, it 
will be useful to represent this bound a bit differently. Let a = log T (fc) £ (0, 1], so that k = T a , and let fj, = T 13 
(where we need to ensure that (3 S [0, a], as /i € [1, k}). Then, a rather tedious but straightforward calculation 
shows that the regret bound above equals 

/ 1-/3 no o + 1 o 8 1+3 -, / n , 1-3 3a-3 3 3 o ; 

( J 1 2 -|- Jwa-P _)_ y3a J— _|_ y3a- 2 -2 _|_ ji— _|_ j.1/2 _|_ _|_ J 1 ^— _|_ J 1 !"^! -|- J^a- , 



Using the fact that /3 < a < 1, we can drop the Tt +T 1 / 2 terms, since it is always dominated by the Tt term 
in the expression. The same goes for the T^ a ~i + T 2a ~i terms, since they are dominated by the term (as 

/? < a < 1). This also holds for the term, which is dominated by the T 2 "~^ term, and the T 3a ~z~ 2 term, 

which is dominated by the T 2a ~P term. Thus, we now need to find the /3 minimizing the maximum exponent, 

Jo ft q 3 / ? + 1 l + P 
mm max < lot — p , da — , — - — , a + 



13 { ' ' 2 ' 2 ' 4 

This expression can be shown to be optimized for /3 = | max{0, 4a — 1}, where it equals \ + \ max{0, 4a — 1}. 
Substituting back a = log fe (T), we get the regret bound 



O M+I max{0,4a-l}^ = ^ Ay4 + f max{o,4-i]A _ ^ fc i max{o,4-£]A _ j ^maxjo.f -± log fc (T)}y, 
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obtained using 

^ = rp/3 _ yij max{0,4a-l} _ jrf max{o,4- i } _ ^max{o, §- | log fc (T)} 

The derivation above assumed that a < 1 (namely that T > k). For T < k, we need to clip /i to be at most k, 
and the regret bound obtained above is vacuous, as it is then larger than order of \JkT > T. Thus, the bound 
we have obtained holds for any relation between k, T. 



A.3. Proof of Thm. 3 

The proof is very similar to the one of Thm. 1, and we will therefore skip the derivation of some steps which are 
identical. 

We define the potential function Wt = Ylj=i an d S e t that 

Wt+1 t-^y-eMv^ + ea. 



j 



Using a similar derivation as in the proof of Thm. 1, we get 

log(^) < ^2 ft (t) 5i (t) + ^-2 ft (t)^(t)a + ea 
\ * / ' j= i i j= i 

Summing over t = T s + 1, . . . , T s +i, we get 

log (iter) * £ t»m® + t~ E X>(«e 

Now, for any fixed action i s , we have 

Wi(T s+1 + l) > Wi(T s + 2) exp \ ri §,•*(*) 

V t=T„+2 
( Ts+1 ' 



t=T s +2 
T s + 1 



where in the last step we used the fact that by our parameter choices, rj(ji*(t) < 1 (see proof of Thm. 1). 
Therefore, we get that 

bg fe 1 ) £ log (^r 11 ) s \ J/" (t) + log(<,/t) - (15) 

Combining Eq. (14) with Eq. (15) and slightly rearranging and simplifying, we get 

' log(fc/a)S" 7/ ^ Jl» 2 ea(T s+ i-T s ) 



E H^^EEft^w^^ E E^«') 2 

t=T, + l ' t=l j=l ' 't=T 3 + lj = l 

Summing over all time periods s, we get overall 

s=lt=T s +l ' t=l j = l 1 " t=l j = l 1 
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In the proof of Thm. 1, we have already provided an analysis of these terms, which is not affected by the 
modification in the algorithm. Using this analysis, we end up with the following bound, holding with probability 
at least 1 — 6 — P(v, rj, fi): 

EE^w-Eft.w ^ » ( 8 \/ 61o g (t ) vT ) + ^Y4 log W) + Vfc2r 

+ 2v / 5 log(3/ 5)vT + l0g(fc/a) ^ + £aT + 6 (Vk + V k + kW) , 

where hides numerical constants and factors logarithmic in 5. Plugging in a — 1/T and r) = y/S/fiT, we get 
the bound stated in the theorem. 



A.4. Proof of Thm. 4 

The proof is almost identical to the one of Thm. 2, and we will only point out the differences. 
Starting in the same way, the analysis leads to the following bound: 



S T s+1 k 



s=l t=T s + l j=l 



E E E^h^w -»(*)) ^ 6 [\l s {j i + ^ +v ) T+k i + ^ 



This bound holds for any choice of rewards. Since each gj(t) is chosen i.i.d. and independently of Pj(t)) 1 wc 

get that J2j=iPj(t)((9i s (t) ~ 9j(t)) ~ ^[9i s (t) ~ 9j(t)]) IS a martingale difference sequence. Applying Azuma's 
inequality, we get that with probability at least 1 — 8 over the choice of rewards, 

S T s+1 fc 

E E !>(*) &•(*)-&(*)) 

s=l t=T B + l j = l 

S T s+1 k 

>EE E*i(*) W - *(*)]) - V21og(l/^)T 

s=l t=T a + l i=l 

>EE E ^WA-V2iog(i/5)r. 

s=l t=T s + l je[fe]\G s 

Thus, by a union bound, with probability at least 1 — 26 — P(v,S,fi) over the randomness of the rewards and 
the algorithm, we get 

E E E Pi(*) < o U(^ +fi+ v)T fc2 e A 



A* T 3 / 2 



As in the proof of Thm. 2, we use the inequality ||p(f)lli/2 < 2|G S | + fe53j£[fc]\G-Pi(*) anc ^ the assumption that 
|G S | is considered a constant independent of k,T, to get 



^EHpWIIi/ 2 <o^ 



fc 2 fc 3 fc 3 

+ 1-L + V — + — + 



(i ^ ) T iiT T 5 / 2 



The rest of the proof now follows verbatim the one of Thm. 4, with the only difference being the addition of the 
S factor in the square root. 



A. 5. Proof of Thm. 5 

Following standard lower-bound proofs for multi-armed bandits, we will focus on deterministic algorithms, We 
will show that there exists a randomized adversarial strategy, such that for any deterministic algorithm, the 
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expected regret is lower bounded by £l(y/kT). Since this bound holds for any deterministic algorithm, it also 
holds for randomized algorithms, which choose the action probabilistically (this is because any such algorithm 
can be seen as a randomization over deterministic algorithms). 

The proof is inspired by the lower bound result 3 of (Garivier & Moulines, 2011). We consider the following random 
adversary strategy. The adversary first fixes A = 1/5. It then chooses an action a € {2, . . . , k} uniformly at 
random, and an action to € [t] with probability 

Pr(*o = T) = i and Pr(t = t) = \ Vt + T 

The adversary then randomly assigns i.i.d. rewards as follows (where we let B{p) denote a Bernoulli distribution 
with parameter p, which takes a value of 1 with probability p and otherwise): 




In words, action 1 is the best action in expectation for the first to rounds (all other actions being statistically 
identical), and then a randomly selected action a becomes better. Also, with probability 1/2, we have to — T, 
and then the distribution does not change at all. Note that both to and a are selected randomly and are not 
known to the learner in advance. 

For the proof, we will need some notation. We let E[-] denote expectation with respect to the random adversary 
strategy mentioned above. Also, we let E" o [•] denote expectation over the adversary strategy, conditioned on the 
adversary picking action a G [k] and shift point to £ {1, . . . , T}. In particular, we let Ey denote expectation 
over the adversary strategy, conditioned on the adversary picking t — T (which by definition of to, implies that 
the reward distribution remains the same across all rounds, and the additional choice of the action a does not 
matter). Finally, define the random variable A t a to be the number of times the algorithm chooses action a, in 
the time window {t, t + 1, . . . , min {T,t + d\VT]}, where d is a positive integer to be determined later. 

Let us fix some to < T and some action a > 1. Let "Pf denote the probability distribution over the sequence 
of d[v / T'l rewards observed by the algorithm at time steps to + 1, . . . , to + d[v / T] , conditioned on the adversary 
picking action a and shift point t . Also, let Vt denote the probability distribution over such a sequence, 
conditioned on the adversary picking to = T and no distribution shift occurring. Then we have the following 
bound on the Kullback-Leibler divergence between the two distributions: 

AwtolTO = E D kl(n(9i t (t) I ft to (*o),---,fl it _ 1 (t-i)) II ^ Q (-K(io),---,ffH- 1 (t-i))) 
t=to 



1 + 2A 



= E [JVg] 2Alog f j < 2AE [N? ] 



Using a standard information-theoretic argument, based on Pinsker's inequality (see (Garivier & Moulines, 2011), 
as well as Theorem 5.1 in (Auer et al., 2002)), we have that for any function f(r) of the reward sequence r, 
whose range is at most [0,6], it holds that 



E£[/(r)] -E [/(r)]] < b\/~D kl {Vq\\V? 



3 This result also lower bounds the achievable regret in a setting quite similar to ours. However, the construction is 
different and more importantly, it does not quantify the dependence on k, the number of actions. 
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In particular, applying this to N" o , we get 

E? o [N*] < E T [N t a ] + d\Vf] \JaEt M. 
Averaging over all to € {1, . . . , T — d\VT~\ }, a € {2, . . . , k} and applying Jensen's inequality, we get 

E 



^T-d[VT] 

(k-l)(T-d\VT\) 



■r^fc -d\V-l \ V a [Aral E T 

2^a=2 l^t a = l ^t L JV *oJ < 



v fc ^T-dTv^l N a 

(k-i)(T-d\VT]) 



-d\VT] 



AE 7 



^k ^T-d[VT] Ma 
Z^a=2 Z^t = l t 



\ (k-l)(T-d\VTl 



Now, let TV-* 1 denote the total number of times the algorithm chooses an action in {2, . . . , k}. It is easily seen 
that 

E E NZ<d\VT\N>\ 

a=2 t = l 

because on the left hand side we count every single choice of an action > 1 at most d\\/T~\ times. Plugging it 
back and slightly simplifying, we get 



(k-l)(T-d\VT]) 



< 



d\VT 



(k-l)(T-dWT]) 



E T [iV >1 ] 



d 3 A\VT] 



{k-i)(r-d\vr\) 



E T [iV> 1 ] 



(16) 



The left hand side of the expression above can be interpreted as the expected number of pulls of the best action 
in the time window [to, ■ ■ ■ , to + d |~vT] ] , conditioned on the adversary choosing to < T — d [~vT] ■ Also, AE^ [^V > 1 ] 
is clearly a lower bound on the regret, if the adversary chose to = T and action 1 remains the best throughout 
all rounds. Thus, denoting the regret by R, we have 

E[R] > Pr (t Q <T-d\VT]) E \r t Q < T - d\VT] \ + Pr(i = T)E[R\t = T] 



T-d\VTl 

> — - — ^-^E 
~ 2(T-1) 



R 



t <T-d[Vr] 



-E T [R] 



> 



2(T- 1) 



(k-i)(T-d[VT]) 



-E T [7V 



>ii 



We now choose d = \Jk — 1/10, plug in A = 1/5 and Eq. (16), and make the following simplifying assumptions 
(which are justified by picking the constant C in the theorem to be large enough): 



T-d\Vr^ > 4 



d\y/T\ 



< 



6 



[VT] 2 6 r- 

1 V 1 < -y/T. 



T-l - 5 ' (k-l){T-d\y/T\) ~ 5(k-l)y/T 25^/(k - 1)T ' VT ~ 5 
Performing the calculation, we get the following regret lower bound: 

6 



250 



10 625v/(fc- 1)T 



E T [N 



>ii 



3125 



^2(V(fe-l)T) E T [7V>i] 



and lower bounding the y/(k — 1)T in the middle term by 1, we can further lower bound the expression by 

3 -^2(V(fc-l)r) E T [i\P _ 



— J(k-l)T+ —EtIN^] 
250 v v ; 1250 1 1 3125 



How small can this expression be as a function of Et[N >1 ]7 It is easy to verify that the minimum of any function 
f(x) = wx — y/vx is attained for x — v/Aw 2 , with a value of — v/Aw. Plugging in this value (for the appropriate 
choice of v,w) and simplifying, the result stated in the theorem follows. 
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A.6. Proof Sketch of Thm. 6 

The proof idea is a reduction to the problem of distinguishing biased coins. In particular, suppose we have two 
Bernoulli random variables X, Y, one of which has a parameter | and one of which has a parameter h + e. It 
is well-known that for some universal constant c, one cannot succeed in distinguishing the two, with a fixed 
probability, using only at most c/e 2 samples from each. 

We begin by noticing that for the fixed distribution (pi, •••,£>&), there must be two actions each of whose 
probabilities is at most l/(k— 1) (otherwise, there are at least k — 1 actions whose probabilities are larger than 
l/(fc — 1), which is impossible). Without loss of generality, suppose these are actions 1, 2. We construct a bandit 
problem where the reward of action 1 is sampled i.i.d. according to a Bernoulli distribution with parameter |, 
and the reward of action 2 is sampled i.i.d. according to a Bernoulli distribution with parameter | + e, where 
e = y/c'k/T for some sufficiently small c'. The rest of the actions receive a deterministic reward of 0. Note that 
this setting corresponds to the one of Thm. 2, with A = 1/2. We now run this algorithm for T — c'k/e 2 rounds. 
By picking d small enough, we can guarantee that with overwhelming probability, the algorithm samples actions 
1,2 less than c/e 2 times. By the information-theoretic lower bound, this implies that the algorithm must have 
chosen a suboptimal action for at least f2(T) times with constant probability. Therefore, the expected regret is 
at least f2(eT), which equals £l{^fkT) by our choice of e. 



